LA1ME1  NOTICE 


THIS  DOCUMENT  IS  BEST 
QUALITY  AVAILABLE.  THE  COPY 
FURNISHED  TO  DTIC  CONTAINED 
A  SIGNIFICANT  NUMBER  OF 
PAGES  WHICH  DO  NOT 
REPRODUCE  LEGIBLY. 


AD  /  3  y 


INTERl  M  REPORT 
1013.1  1 

QUAXT  I  !•'  I  CAT  I  ON  01  I  NFOR!  'AT  1  ON  STORAGE 
AND  RETR1  1A  AI,  Ml"!  ilODOl.OC]  ES 


CONTRACT  NUMBER  N00014  -  70 -C-0044 

RESEARCH  SPONSORED  BY  THE  OFEICE  OF  NAVAL 
RESEARCH  UNDER  NR  I,  PROJECTS  RF018-02-41  AND  RR003- 09  -  4 1  -  50  2 


5  JUNE  1970 


ANALYTICS ,  INCORPORATED 
179  WASHINGTON  LANE 
JENKINTOKN,  PENNSYLVANIA  19040 


Thl*  do."ur  >  ■  »  ;*»r  iot>rcv*d 

tor  puoiic  itc  »cu*:  ttn 

dlitrlbattoB  u  uillmitad 


Reproduction  in  whole  or  in  port 
is  permitted  for  any  purpose  of  the  n 

United  States  Covernmenl  r ' 


r 


l 

I 

■  QUANT  I  ];  I  CATION  OT  INFORMATION  STORAGE 

*  AND  RETRIEVAL  METHODOLOGIES 

1 


This  Interim  Report  1013.1-1  represents  com¬ 
pletion  of  Phase  1  of  work  under  contract  N00014-70-C- 
0044  with  the  Office  of  Naval  Research  for  the  U.S.  Naval 
Research  Laboratory.  This  report  is  composed  of  two  papers 
which  represent  the  areas  of  concentration  of  this  study 
as  directed  by  the  Contract  Scientific  Officer  and  his 
designated  representatives. 
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ABSTRACT 


This  paper  presents  the  results  of  a  set  of  Monte 
Carlo  computations  designed  to  show  the  general  behavior 
of  the  efficiency  of  probabilistic  information  retrieval 
systems  as  a  function  of  human-variability  noise.  The  total 
amount  of  noise,  the  combination  of  noise  produced  in  in¬ 
dexing  documents  and  in  formulating  requests,  is  the  indepen¬ 
dent  variable.  The  effect  of  noise  is  measured  by  the 
fraction  of  the  file  that  must  be  retrieved  in  order  to  ob¬ 
tain  the  document  that  in  the  absence  of  noise  would  be 
retrieved  first.  Computations  are  made  for  an  idealized 
system  in  which  the  index  and  request  vectors  are  normalized 
and  have  uniform  distributions;  however,  the  method  could 
accommodate  other  distributions.  The  results  show  how,  for 
a  fixed  amount  of  noise,  the  depth  of  file  search  decreases 
with  increasing  numbers  of  index  categories  for  each  con¬ 
stant  ratio  of  terms  specified  in  the  query  to  index  cate¬ 
gories  in  the  space.  Also,  for  a  fixed  number  of  index 
categories,  the  way  in  which  the  fraction  of  file  searched 
decreases  with  the  number  of  index  terms  in  the  query  is 
s  h  ow  n  . 


INTRODUCTION' 


Probabilistic  indexing  techniques  as  first  in¬ 
troduced  by  Maron  and  Kuhns  [l]  arc  capable  of  a  wider 
variety  of  responses  than  boolean  systems.  In  a  probabilistic 
retrieval  s  vs  ten  each  document  I).  is  assigned  an  index 
vector  \b  whose  elements  quantify  the  degree  to  which  each 
index  tern  describes  the  document.  Likewise,  request  vec¬ 
tors  describe  information  needs  in  terms  of  the  same  index 
space.  The  relevance  of  any  document  to  a  request  R  is 
then  a  function  of  \b  and  R,  and  the  response  to  a  request 
is  an  ordering  of  the  documents  according  to  their  rele¬ 
vance  to  that  request. 

The  probabilistic  system  allows  the  elements 

v. .  of  the  index  vector  V.  and  the  elements  r,  of  the  re- 
1  j  i  k 

quest  vector  R  to  take  on  any  value  in  the  range  (0,1). 

Thus  a  relevance  measure  such  as  r=V^*R  (suitably  normalized) 
associates  relevance  with  the  intuitive  concept  of  distance 
between  document  and  request  in  Euclidean  space.  Stiles  [2] 
and  Shumway  [3]  further  expand  the  range  of  probabilistic 
techniques  by  introducing  clustering,  the  grouping  of  in¬ 
dex  terms  by  statistical  association  of  the  indexed  docu¬ 
ments.  Jones  and  Needham  [4]  demonstrate  a  system  based 
upon  matching  request  and  document  groups. 

In  the  present  paper,  we  consider  the  effect  of 
human  variability  noise  in  the  generation  of  index  and  re¬ 
quest  vectors  upon  the  efficiency  of  probabilistic  retrieval 
systems . 


2  . 


Definition  of  the  Problen 


Let  D  he  an  arbitrary  document  with  index  vector 
V.  Because  of  indexing  noise,  it  is  assigned  the  index 
vector  \'n  instead.  A  user  has  an  information  need  that  is 
exactly  satisfied  hv  the  document  D.  lie  should  therefore 
express  his  need  by  the  request  vector  k,  where  ideally 
R=V,  but  because  of  request  noise,  he  specifies  Rn  instead. 
The  retrieval  system  ranks  each  document  in  the  file 
according  to  the  value  of  the  inner  product  V^‘Rn  of  its 
index  vector  Y1?  with  the  request  vector  R“.  In  the  absence 
of  noise,  the  desired  document  D  would  have  been  ranked 
first,  but  with  noise  present,  it  may  well  be  outranked  by 
other  documents.  The  purpose  of  the  computations  here  re¬ 
ported  is  to  investigate  how  the  ranking  of  D  is  affected 
by  the  amount  of  noise  as  measured  by  the  inner  product 

r  =  Rn • Vn  (1) 

where  both  index  vector  Yn  and  request  vector  Rn  are 
normalised.  The  effect  of  indexing  noise  and  request  noise 
is  expressed  by  the  departure  of  r  from  the  noiseless  -  case 
value  of  r=l. 

The  Nature  of  the  Noise 


Indexing  noise  is  the  variability  in  assigning 
an  index  vector  to  a  given  document.  To  measure  indexing 
noise  experimentally,  give  the  same  document  to  a  number 
of  people  for  indexing,  assume  the  mean  to  be  the 
correct  index  vector  for  the  document,  and  observe  the  de¬ 
partures  of  the  individual  index  vectors  from  the  mean. 
This  procedure  gives  variations  about  the  mean  and  ignores 
variations  of  the  mean  which  can  he  made  small  by  using  a 
sufficiently  large  number  of  indexers.  Similarly,  request 
noise  exists  because  two  people  with  the  same  need  for  in- 


formation  do  not  always  express  the  need  by  means  of  iden¬ 
tical  request  vectors. 

This  paper  reports  no  measurements  of  either  in¬ 
dexing  noise  or  request  noise;  instead,  the  depth  of  file 
search  is  presented  as  a  function  of  the  noise  which  is 
measured  by  the  angular  distance  between  Vn  and  Rn . 

Use  of  a  Document -Generati n;  Pi s t r i bu t i on 

In  a  small  file  it  is  not  important  that  the 
probabilistic  information  retrieval  system  perform  extremely 
well.  If  it  does  not,  a  small  number  of  documents  are  ex¬ 
amined  unnecessarily.  In  a  large  file,  the  penalty  for 
poor  performance  is  greater.  The  file  size  is  therefore  a 
parameter  affecting  system  effectiveness.  To  eliminate  this 
parameter,  the  set  of  documents  is  represented  by  a  generating 
distribution  instead  of  a  finite  set.  Rather  than  counting 
the  number  of  documents  that  outrank  the  desired  document  P, 
that  is,  the  number  of  for  which 

V"-Rn >  Vn*Rn=r,  (2) 

the  computation  will  estimate  the  probability  that  equation 
(2)  is  satisfied  by  a  ,  with  index  vector  V?,  randomly 
selected  from  the  generating  distribution. 

Choice  of  the  Document -Gene rat  in<;  Distribution 

The  ranking  of  documents  that  the  system  produces 
in  response  to  a  request  vector  is  unaffected  if  it  is 
multiplied  by  a  positive  scalar.  Therefore,  there  is  in 
loss  in  generality  in  normalizing  the  request  vectors  so 
that  their  Euclidean  lengths,  the  square-root  of  the  sum 
of  the  squares  of  the  vector  elements,  are  all  unity.  The 
request  vectors  arc  assumed  to  be  normalized  in  this  manner. 


4. 


'I ho  iiulox  vectors  are  also  assumed  to  lie  nor¬ 
malised.  This  assumption  is  not  innocuous;  it  has  physical 
implications.  It  implies,  for  example,  that  every  document 
is  equally  worthy  of  retrieval,  needing  only  the  proper 
request  vector  to  mahe  it  the  first -ranked  document  in  the 
response.  In  particular,  this  assumption  implies  that  a 
document  dealing  with  a  wide  variety  of  subjects  --  a  hand¬ 
book,  for  example  --  is  not  accorded  greater  or  lesser  pro¬ 
minence  in  the  retrieval  system  than  a  highly-speciali zed 
document.  Under  the  assumption  of  normalization,  the  index 
vectors  nay  be  represented  geometrically  in  Euclidean  n- 
space  as  vectors  emanating  from  the  origin  and  with  terminus 
on  the  positive  orthant  (including  boundary)  of  the  unit 
sphere  --  n  being,  for  the  moment  anyway,  the  number  of 
index  terms.  The  number  n  is,  like  the  noise  level,  a 
parameter  in  the  results  presented  in  this  paper. 

It  is  further  assumed  that  the  index  terms  are 
uniformly  distributed  over  the  positive  orthant  of  the  unit 
sphere.  This  is  a  powerful  assumption,  but  not  as  drastic 
as  it  first  sounds,  for  the  following  reason.  If  n  is  the 
total  number  of  index  terms  in  the  system,  as  suggested 
above,  the  assumption  of  an  even  distribution  of  index 
vectors  is  totally  unrealistic  because  it  is  known  that 
index  terms  commonly  occur  in  clusters.  But  if  the  para¬ 
meter  n  is  interpreted  in  the  results  as  the  number  of  index 
terms  in  one  of  the  clusters,  the  assumption  is  less  objection¬ 
able  since  a  desirable  selection  of  index  terms  within  a 
cluster  is  the  selection  giving  uniform  scope  to  each  term. 

I  he  results  under  this  interpretation  show  the  fraction  of 
documents  in  the  cluster  that  must  he  retrieved  to  reach 
the  dociaent  that  would  he  retrieved  first  in  a  noiseless 

S  V  S  t  '  ’  P  . 
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The  assumptions  of  normalization  and  uniform  dis¬ 
tribution  on  the  index  vectors  arc  made  for  two  reasons: 
first,  they  provide  simplification  in  the  mathematics  and 
second,  they  do  not  conflict  with  what  is  known.  If  there 
were  good  reason  to  believe  in  any  other  specific  distribu¬ 
tion  of  the  index  vectors,  simplicity  of  mathematics  could 
be  sacrificed  for  the  sake  of  realism,  and  computations 
such  as  those  here  reported  could  he  performed  using  the 
more  realistic  distribution. 

The  Computation  Problem 

As  a  result  of  the  assumptions  set  forth  above, 
the  problem  of  computing  the  efficiency  of  the  retrieval 
system  has  a  simple  geometric  representation  in  n - dimensi onn 1 
Euclidean  space.  Let  S  denote  the  n -di mens iona 1  unit  sphere: 
the  set  of  points  (Xj,...,xp)  for  which 


2  2 

x,  +  . . . +x  =  1 . 
1  n 


(3) 


Let  S+  represent  the  positive  orthant  of  S,  the  set  of 
points  satisfying  (3)  with 


Xj  >  0,  j  =  l, . . .  ,n  .  (4) 

The  assumptions  on  the  index  vectors  is  that  they  are  uni¬ 
formly  distributed  over  S+ .  The  r’s  are,  of  course,  the 
weights  used  in  the  index  and  request  vectors. 

Let  0  be  the  origin,  the  center  of  the  unit  sphere 
S,  and  let  P  ana  Q  denote  any  points  on  S.  Then  the  angle 
TOQ  is  called  the  angular  distance  between  P  and  Q. 

Let  M(9/Z)  denote  the  measure,  the  (n  11-dimen¬ 
sional  "area”,  of  those  points  that  are  both  in  S+  and  within 
angular  distance  0  of  Q,  where  Q  is  an  arbitrary  point  in 


S+ .  If  *!(S+]  is  l ho  measure  of  S+ ,  then  the  rat 


M  ( (V0J_ 

MTS+)  (5) 

is  the  fractien  of  documents ,  in  the  index-term  cluster  re¬ 
presented  by  S,  within  angular  distance  0  of  Q;  it  is  the 
fractien  of  the  documents  whose  (inner  product)  relevance 
ntr  be r  r  with- respect  to  a  request  vector  Q  is  at  least 

r  =  cosG  . 

If  ')  is  allow  d  to  range  with  uniform  distribution  over  S+ , 
the  mean  value  of  the  ratio  (5)  is  the  average  fraction  of 
documents  that  would  have  to  be  retrieved  to  insure  finding 
all  documents  displaced  by  angular  distance  0  from  the 
position  of  the  request  vector  as  a  result  of  noise  effects. 

But  uniform  distribution  of  Q  over  S+  is  un¬ 
necessarily  restrictive,  therefore,  there  is  introduced 
another  parameter  k,  the  number  of  non-zero  index-term 
weights  in  the  request  vector  Q,  where  k  <  n.  For  example, 
if  n= S  and  k=i,  the  request  vectors  are  of  the  form 
(x1,...,xn)  where 

x“+xf +x“+x’  =  1,  a^b^c^d  (6) 

a  b  c  d  ’ 

and  vlie  re  a,  b,  c,  and  d  are  any  four  different  selections 
fn  -  the  eight  numbers  (I,..., 8),  the  other  four  weights  being 
:  e  re-.  Since  in  the  results  it  does  not  matter  which  k  of  the 
n  weights  are  non. -zero,  it  will  be  assumed  henceforth  that 

Xj . x.  are  the  non  zero  weights  and  x|t+]»***,xn  tire  zero. 

In  tin  cv  mutations  here  reported,  the  request  vectors  Q  arc 
1  !  i ;  g  ?,.  hr  uai  i'dmlv  distributed  over  the  positive  orthant 


7. 


1. 


(7) 


Let  T  denote  the  distribution  of  Q-vectors  just 
described,  and  let  M(9/T)  denote  the  mean  of  M(0/Q)  over  T. 
Then  the  ratio 


H(9/T) 

M(S+)  (8) 

is  the  average  fraction  of  documents  that  would  have  to 
be  retrieved,  under  the  assumptions  of  the  computation,  to 
reach  a  document  displaced  by  noise  effects  through  an  angular 
distance  9  from  an  initial  position  coincident  with  the  re¬ 
quest  vector.  Of  course,  if  a  more  realistic  distribution 
T  were  known  for  the  request  vectors,  it  could  be  used  in 
the  ratio  (8)  in  place  of  the  uniform  distribution  presently 
used . 

The  Appendix  presents  the  details  of  the  com¬ 
putational  procedure  used.  The  program  to  perform  this  com¬ 
putation  has  not  been  included,  but  is  available. 

Results 

The  results  of  the  Monte  Carlo  computations  are 
presented  as  the  curves  of  Figures  1  through  4.  These 
curves  plot  the  fraction  F(r)  of  the  subfile  that  must  be 
searched  against  the  relevance  number 


where 


r=cos  9 

Ffr)  =  SGZI1 
M(S+) 


(9) 


(10) 


as  in  ( S ) . 


8. 


To  summarise  the  meaning  of  the  curves:  if  there 
is  a  document  1>  that  exactly  matches  the  user's  information 
needs  but,  because  of  indexing  noise  and  request  noise,  the 
actual  index  vector  for  D  and  the  actual  request  vector  arc 
at  angular  distance  0  apart,  F(r)  is  the  fraction  of  the  sub¬ 
file  that  must  be  searched,  on  the  average,  before  the  user 
finds  U.  "Subfile"  here  means  the  portion  of  the  file  that 
deals  with  the  cluster  of  index-terms  into  which  D  falls. 

The  parameters  n  and  k  are,  respectively,  the  number  of 
index- terms  in  the  cluster  and  the  number  of  those  index- 
terms  that  occur  with  non-zero  weight  in  the  request  vector, 
k  <  n  . 

The  Monte  Carlo  sample  sizes  used  were  inadequate 
for  reliable  determination  of  F(r)  for  large  € ;  the  portions 
of  the  curves  shown  dotted  are  extrapolations. 

The  curves  clearly  show  how,  for  fixed  values  of 
the  ratio  k/n,  the  fraction  of  the  file  that  need  be  searched 
for  a  given  value  of  0  decreases  with  increasing  values  of  n, 
the  number  of  index- terms  in  the  cluster.  However,  it  is  to 
be  expected  that  the  human  indexing  noise  represented  by  -0 
increases  with  n,  and  does  so  possibly  fast  enough  to  out¬ 
weigh  the  decrease  shown  for  a  fixed  0.  Comparison  among  the 
four  figures  shows  how,  for  fixed  n,  the  fraction  of  the  file 
that  must  he  searched  decreases  with  k,  the  number  of  index- 
terms  used  in  the  query;  that  is,  with  the  degree  of 
specialization  of  the  document  within  the  cluster. 


Figure  1 
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Summary 


The  experiment  considered  in  this  paper  provides 
qualitative  characterization  for  the  efficiency  of  pro¬ 
babilistic  information  retrieval  systems.  Although  an 
idealized  system  has  been  employed,  the  methodology  pre¬ 
sented  can  be  extended  to  actual  systems  and  made  highly 
dependent  on  the  particular  properties  of  a  real  data  base. 
With  the  advent  of  non-Boolean  retrieval  methodologies 
over  the  past  decade,  the  need  for  such  tools  to  aid  in 
evaluation  has  become  apparent.  It  is  hoped  that  extension 
of  this  model  to  specific  existing  systems  will  be  ac¬ 
complished  in  the  near  future. 
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Appendix  -  Commitation  of  Fraction  of  Search 


The  Nontc  Carlo  computations  for  F(r),  equation 
(10),  are  here  briefly  described. 

The  denominator  M(S+)  of  (10)  is  (^-)n  times  the 
measure  of  the  (n- 1) -dimensional  "surface"  of  the  n- 
dimensional  unit  sphere  or,  for  n=2m  (only  even  values  of 
n  were  used) , 

2"n  2  n  m 

M(S+)  = - ,  n/2  =  m  integral.  (11) 

(m-1): 


Only  the  numerator  M(9/T)  need  be  discussed. 


If  Q  is  at  an  angular  distance  greater  than  © 
from  the  nearest  point  on  the  boundary  of  S+ ,  then  M(G/Q) 
of  (5)  becomes 


where 


M(6/Q) 


0  d  0  , 


(12) 


K  =  fT  ^2n  *  (m- 1) !  ,  n/2  =  m  integral  . 
(n-2) ! 


(13) 


If  Q  is  at  an  arbitrary  distance  from  the  boundary  of  S+, 

. 6 


M(e/Q) 


■*/ 


f (0/Q)  Sin11'2 


0  d  0 


(14) 


where  f(0/Q)  is  the  fraction  of  the  "ring"  at  angular  distance 
0  from  Q  that  lies  in  S+.  Then  the  desired  numerator  in  (10) 
is 

.6 


M(0/T) 


f ( 0/T)  sin 


n-2 


0  d  0 


(15) 


where  7(0/1)  is  the  mean  fraction  of  the  ring  at  angular 
distance  0  from  Q  that  lies  in  S+,  averaged  over  the  dis¬ 
tribution  T  for  0 . 


14. 


F-Hui  v;i  lent  ly  ,  7(0/T)  is  the  mean  probability, 
ave rayed  over  the  distribution  T,  that  a  step  of  angular- 
distanee  sire  0  from  Q  in  a  random  direction  Kill  land  in 
S+ .  Also  epui valently ,  f (0/T)  is  the  probability  that  when 
a  Q  is  drawn  from  T  and  a  random  direction  (uniformly 
distributed)  is  selected  along  the  surface  of  the  unit 
n-sphere  at  0,  the  boundary  of  S+  is  at  an  angular  distance 
of  at  least  in  that  direction.  Finally,  if  Q  is  drawn 
from  T  and  a  random  direction  (uniformly  distributed)  is 
selected  along  the  surface  of  the  n-spherc  at  Q  and  the 
angular  distance  in  that  direction  to  the  boundary  of  S+ 
is  measured,  the  cumulative  distribution  function  of  the 
measured  angular  distance  is 

P(0)  =  l-f(0/T), 
or 

7(0/T)  =  1-P(0),  (16) 

where  P(0)  is  the  cumulative  distribution  function  of  the 
angular  distance  to  the  boundary  as  just  defined.  This 
last  interpretation  of  T(0/T)  is  the  one  used  in  the  Monte 
Carlo  computations. 

First  the  point  was  drawn  from  a  uniform  dis¬ 
tribution  over  the  positive  orthant  of  the  k-dimcnsional 
sphe  re 

x2  +  ...+  x2  =  1  (17) 

as  follows:  each  of  the  x  =  (x^,...,Xj.)  was  taken  to  be  the 
sum  of  six  independent  random  numbers  unirormly  distributed 
over  the  interval  bach  x^  was  therefore  approximately 

a  sample  from  a  normal  distribution  with  zero  mean.  By  the 
relevant  property  of  normal  distributions,  the  set  (Xj  ,  .  .  .  ,Xj. )  , 
regarded  as  the  coordinates  of  a  k-sphere,  was  a  sample  from 


a  k- dimensional  normal  distribution  with  spherical  symmetry. 
The  normalisation 


^i 


(x 


i  =  1 


,k 


(18) 


gave  the  desired  =  (q^  ,  . . .  ,q^) .  Setting 

qi  =  0 ,  i  =  k  + 1 , . . . , n  (19) 

gave  the  Q  =  (q^,...,qn)  as  defined  in  connection  with 
equations  (6)  and  (7). 

Next,  a  direction  along  the  surface  of  the  n- 
sphere  from  Q  might  be  found  ("might"  because  a  more  econo¬ 
mical  way  is  given  below)  by  locating  the  point  C  as  follows 
draw  a  point  (x^,...,xn)  from  an  n-dimensi onal  normal  dis¬ 
tribution  with  spherical  symmetry  by  taking  each  to  be 
the  sum  of  six  independent  random  numbers  evenly  distributed 
over  Compute 

‘‘W-'V.,  (20) 


and  set 


i  =  1 , . . .  ,n  . 


(21) 


QC  is  then  tangent  to  the  unit  sphere  and  OQC  is  a  right 
angle,  because,  as  may  be  verified. 


n 

e . (c.  -  q. )  =  0.  (22) 

.  ,11  li ' 

i=  1 


Howe'er,  if  the  procedure  of  (20),  (21)  and  (22)  is  followed 
just  as  outlined,  the  angular  distance  from  Q  in  the  directi 
of  QC  will  he  zero  with  probability  1-2  11  ”,  because  that 


1  (> . 


distance  is  torn  if  any  of  Xj.+  j,..,,x  are  negative.  To 
save  confutation,  therefore,  x^+j,...,x^  were  constrained 
to  he  positive  by  a  method  equivalent  to  using 

lx.! 

c.  =  i-JJ-  ,  i=  k+1 ,  .  .  .  ,n  (23) 


in  place  of  part  of  (21).  The  compensation 

P(0)  =  l-  2  n  +  k  (1-p*  (0) )  (24) 

was  applied  for  tlve  distortion  in  the  sampling  represented 
by  (23),  where  p*(0)  was  the  observed  cumulative  distribu¬ 
tion  function  found  using  (23). 

Using  (16)  and  (24)  in  place  of  (14)  gives: 


M(9/T)  =  K  /  (1-p* (0) )  sinn’“ 


0  d0. 


Returning  to  the  problem  of  determining  p*(0) 
for  cumulative  distribution  function  of  the  angular  distance 
0  from  v)  to  the  boundary  of  S+  in  the  direction  QC,  consider 
the  point  X(t)  =  (Xj,...,x  )  where 


xi  =  (l-t)qi+tci,  i=l,...,n.  (26) 


For  t=P,  X(t)=Q  and  for  t=l,  X(t)=C.  The  point  X(t)  moves 
continuously  from  Q  to  C  in  the  direction  QC  as  t  increases, 
for  large  enough  t,  except  in  unlikely  special  cases  that 
need  not  he  discussed  here  nor  be  guarded  against  in  the  com¬ 
mutations,  at  least  one  of  the  coordinates  will  pass  through 
zero.  The  smallest  positive  value  of  t  for  which  this  occurs 
mark<  the  point  X  on  the  line  QC  where  that  lino  leaves  the 
positive  orthant  of  the  space.  1'his  smallest  positive  value 
is  foti:..!  by  solvin';  the  n  equations 


0  =  (l-t)qi+tci,  i n  (27) 

for  t  and  noting  the  smallest  non-negative  solution.  Using 
this  value  of  t  in  (2b)  gives  the  coordinates  of  X.  The 
resulting  value  of  0  is 

0  =  arccos  — ~ - - - j  ,  (28) 

(x^+  .  .  •  +  xn) ~  ' 

because  OQX  is  a  right  angle  and 

OQ 

cos0  =  — — —  .  (29) 

OX 

One  thousand  independent  sample  values  of  0 
as  in  (28)  were  computed  for  each  curve  of  F(r)  versus  r 
in  the  plotted  results.  The  values  of  0  were  ordered  and 
numbered  such  that 

M  02  <  ...  £  01OOO  .  (30) 

The  quadrature  of  (25)  was  approximated  by  a  sum 
in  the  obvious  way  using  1000  terms,  the  j1"*1  representing 
the  interval  0j  ,  <  ^  <  0^  ,  with  p*(0)  taken  equal  to 
(j-l)/1000  over  the  intcival.  Together  with  (11),  (.25) 
gives  the  result  (10)  for  I-'(r),  th"  fraction  of  the  subfile 
that  must  be  searched. 
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This  document  describes  a  generalized  model  of  an 
information  storage  and  retrieval  system  and  an  algorithm 
for  determining  the  depth  to  which  a  file  must  Le  searched 
to  overcome  indexing  and  querying  noise.  To  make  the  re¬ 
sults  of  the  study  relevant  to  real  information  retrieval 
systems,  the  model  will  be  provided  with  statistical  para¬ 
meters  taken  from  actual  files. 

The  general  model  proposed  here  is  a  combination 
of  two  specific  models  that  have  been  used  to  describe  in¬ 
formation  retrieval  systems,  the  Boolean  model  and  the  pro¬ 
babilistic  model.  These  two  models  will  be  presented  in 
order  to  demonstrate  the  general  model. 

The  Boolean  Model 

In  the  Boolean  model,  a  file  in  an  information 
retriet'al  system  is  represented  by  an  m-dimensional  space, 
in  which  each  dimension  represents  an  index  term  used  in 
the  file.  The  total  number  of  index  terms  used  in  the 
file  is  therefore  m.  Let  the  sequence  t^,  be 

an  ordering  of  the  terms  of  the  file. 

A  document  lh  ,  stored  in  the  file,  is  represented 
by  the  vector  V  (a^  ,  a  j  7  ,  .  .  .  ,  a  .^)  where  each  clement  a., 
will  have  the  value  one  if  IR  is  indexed  by  term  t.,  and' 
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zero 


if  not . 


Since  a  document  may  he  indexed  by  any  subset  of 
the  1:1  terms  of  the  file,  the  documents  arc  potentially  dis¬ 
tributed  throughout  all  of  the  2m  points  which  represent  the 
corners  of  the  in- d  i  mens  i  ona  1  unit  hypercube. 


For  consistency  with  previous  work,  all  the  docu¬ 
ments  are  assumed  to  be  equal  I v  worthy  of  retrieval  given 
only  that  the  correct  question  is  asked;  therefore,  each 
document  vector  is  normalized  to  length  one. 


The  Probabilistic  Model 


In  the  probabilistic  model,  the  file  is  described 

by  the  same  m-dimens ional  hvperspace.  A  document  IK  is 

represented  by  the  same  vector  \\  ,  except  that  each  now 

represents  the  probability  that  p.  will  satisfy  a  request 

containing  the  term  t.,  where  Osn.  .si  for  all  ]<j<n.  It  is 

J  il 

assumed  that  these  probabilities  are  completely  known.  Assume, 
for  example,  that  the  probabilities  are  generated  by  finding 
each  term  t .  that  occurs  in  the  text  of  D.  and  assigning 
it  some  relevance  number  a^.fsl).  Then,  for  each  of  the 
other  terms  t^  which  do  not  appear  in  I'k ,  the  number  a^sl 
is  calculated  both  from  co-occurrence  data  taken  from  a 
very  large  sample  of  text  and  from  previously  assigned 
values  of  a- ..  Under  those  conditions,  the  term  a-,  will 
take  on  the  value  zero  with  the  same  probability  that  it 
takes  on  any  other  real  value  in  the  open  interval  (0,1). 


The  documents  will  then  be  distributed  in  the 
n - d i men s i ona  1  hvperspace  with  no  finite  density  of  documents 
occurring  in  any  proper  subspace  of  this  space.  As  before, 
the  length  of  each  document  vector  V-  may  be  normalized  so 
that  the  document  values  are  considered  to  'no  distributed 
on  the  surface  of  the  positive  orthant  of  the  m- J i mens i ona 1 
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I 


I 

|  hypersphere  (positive  orthant  since  the  a^'s  arc  non- 

negative).  They  will  not  be  evenly  distributed  on  this 
.  surface  but  will  instead  cluster  close  to  the  border  of 

'  each  of  the  subspaccs .  This  results  from  many  terms  t. 

having  very  small  relevance  to  particular  documents  lh 
and  corresponding  a^j's  being  close  to  zero. 

Tiie  Combined  Model 

Merging  the  two  models,  one  obtains  the  general 
model  for  probabilistically- indexed  files.  In  the  com¬ 
bined  model  the  documents  are  similarly  represented  by  the 

normalized  vector  V.  whose  terms  a. .  have  values  Oia- •; 1 . 

i  il  ij 

Based  upon  this  assignment,  there  is  a  finite  distribution 
of  documents  in  each  subspace  of  the  m-space.  The  documents 
are  distributed  on  the  surface  of  the  n - di mens i onal  hyper- 
sphere  of  each  n-dimensional  subspace  of  the  m-space. 

The  algorithm  developed  measures  the  amount  of 
the  file  that  falls  within  the  solid  angle  b  of  a  query 
vector.  In  making  this  measurement,  the  effects  of  cluster¬ 
ing  upon  the  distribution  of  documents  in  the  m-space  that 
are  known  to  occur  when  documents  are  assigned  index  terms 
must  be  taken  into  account. 

In  the  general  model,  the  clustering  effects  are 
represented  by  the  varying  density  of  documents  populating 
subspaces.  For  the  moment,  assume  that  the  distribution  of 
documents  in  each  subspace  is  known  from  data  taken  from 
an  actual  file.  In  order  to  save  computation,  we  would  like 
to  analyze  only  a  part  of  the  file  to  obtain  results  applicable 
to  the  entire  file.  He  might  begin  by  selecting  one  cluster 
(that  is,  a  relatively  heavily-populated  subspace)  and  work 
within  it,  ignoring,  the  remainder  of  the  file.  This  approach 
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does  not  re  fleet  the  clustering  properties,  because  no  matter 
what  subspace  is  chosen  there  will  be  a  significant  popula¬ 
tion  of  documents  which  cannot  be  completely  described  by 
the  n  index  terms  of  the  cluster  and  yet  arc  relevant  to 
it.  As  an  example  of  this  problem  consider  a  document  which 
has  relevant  all  n  terms  of  the  cluster  plus  one  more. 

The  results  of  such  an  analysis  determine  the 
depth  to  which  the  cluster  must  be  searched.  This  depth  is 
not  a  useful  result  unless  the  size  of  the  cluster  relative 
to  the  entire  file  can  be  measured,  and  unless  it  is  assured 
that  the  document  sought  will  be  in  the  cluster.  In  general, 
this  will  not  be  true.  Since,  then,  the  environment  of  a 
cluster  must  be  considered  in  order  to  determine  the  effects 
of  the  cluster,  the  isolation  of  an  n-dimensional  subspacc 
representation  of  the  cluster  in  the  development  of  the 
model  is  not  sufficient. 

Tt  will  be  more  useful  to  base  the  calculation  on 
the  set  of  index  terms  that  result  from  a  query.  This  set 
may  be  equivalent  to  some  meaningful  statistical  cluster  of 
index  terms  in  the  data  base,  it  may  be  a  subset  of  it,  or 
it  may  be  somewhat  different  from  it. 

The  Approach  to  the  Analysis  Problem 

The  approach  taken  is  to  first  generate  the  query 
vector  and  to  concern  ourselves  with  the  distribution  of 
documents  in  the  file  immediately  surrounding  the  query  vector. 
The  distribution  of  document  vectors  throughout  the  file  is 
general".!  next,  and  the  document  density  in  the  subspaces 
s  i:  r  roar,  d  i  n  r  the  query  vector  is  recorded.  The  document 
vecli'is  '..ill  be  random  variables  whose  distributions  are 
ce-"'  i  led  fi'oi  statistics  taken  from  a  real  file.  No  attempt- 
will  bo  made  to  isolate  or  identify  the  clusters,  but  the 


•1  .A 


distribution  from  which  the  documents  arc  generated  will 
ensure  that  clustering  effects  arc  present. 

The  procedure  is  composed  of  three  steps.  First, 
some  statistics  showing  the  actual  distribution  of  index 
terms  and  their  interrelations  must  be  obtained  for  input 
to  the  model.  Secondly,  after  the  query  vector  has  been 
generated,  the  fraction  of  the  file  specified  by  that  query 
vector  must  be  determined.  How  this  fraction  of  the  file 
is  distributed  throughout  the  subspaces  of  the  space  iden¬ 
tified  by  the  query  vector  must  also  be  found.  Finally, 
beginning  with  the  query  vector,  the  fraction  of  the  subfile 
encompassed  within  the  solid  angle  b  of  the  query  vector 
will  be  determined  as  a  function  of  b.  This  last  step  of 
the  work  is  similar  to  the  existing  model  (  ref.  previous 
paper)  except  that  the  search  will  not  be  limited  to  one 
surface  in  n-space.  Instead  the  search  will  be  extended 
to  each  of  the  n  surfaces  of  dimension  n-1,  then  to  each 
of  the  n*(n-l)/2  surfaces' 'of  dimension  n-2,  and  so  on, 
until  it  encompasses  the  entire  set  of  all  subspaces  of  the 
query  vector  space. 

The  three  steps  are  described  in  detail  in  the 
following  paragrarhs. 

Step  I.  We  will  first  consider  the  size  of  the 
subfile  that  is  implicated  by  a  randomly  -  chosen  query  vector, 
that  is,  what  portion  of  the  total  file  would  be  obtained 
if  every  document  were  retrieved  whose  set  of  index  terms 
had  any  term  in  common  with  the  terms  in  the  query  vector's 
set.  For  small  files,  the  portion  retrieved  is  dependent 
unon  the  size  of  t lie  file:  as  the  size  of  the  file  increases, 
the  total  number  of  index  terms  increases  proportionally, 
and  tbc  portion  of  the  file  represented  by  a  single  ter:. 


decreases.  On  the  other  hand,  for  larger  files  the  total 
number  of  index  terms  tends  to  remain  constant  as  the  file 
size  varies.  Therefore  the  statistics  used  in  this  model 
will  be  taken  from  a  reasonably  large  file,  and  since  the 
results  will  be  expressed  as  fractions  of  total  file,  they 
will  be  applicable  to  all  large  files.  It  is  assumed  that 
the  dociment  distributions,  normalized  by  file  size,  remain 
the  same  for  all  large  files. 

A  query  vector  is  composed  of  n  randomly-selected 
terms.  To  determine  the  size  of  the  subfile  implicated  by 
these  n  terms  the  model  must  contain  some  indication  of  the 
number  of  documents  in  the  file  that  have  been  indexed  by 
exactly  this  combination  of  n  terms,  and  the  number  which 
have  been  exactly  indexed  by  each  of  2n  subsets  of  this  set 
of  n  terms.  This  implies  that  the  distribution  of  documents 
throughout  every  combination  of  the  total  number  of  index 
terms  in  the  file,  m,  must  be  known.  For  moderate  size 
files,  m  will  range  from  200  to  1000  terms.  The  model  must 
then  contain  2“(1f'  to  2^*^  data  items,  thus  making  files  of 
this  size  far  too  large  to  be  considered. 

An  approximation  to  the  total  amount  of  infor¬ 
mation  contained  in  a  full  description  of  the  document  dis¬ 
tributions  throughout  the  total  file's  m-space  can  be  derived 
from  the  record  of  the  distribution  of  the  documents  over 
the  ind. x  terms  taken  one  at  a  time  and  two  at  a  time.  These 
statistics  are  available  from  many  reports  on  information 
retrieval  systems.  It  is  assumed,  then,  that  the  file  is 
described  by  a  list  of  m  terms  giving  the  absolute  pro¬ 
babilities  of  the  appearance  of  an  index  term  t-  in  a  docu- 
men  t  , 

Number  of  documents  indexed  by  t- 
l  Total  number  of  documents 


and  by  another  list  of  m(m-i)/2  terms,  giving  the  pro¬ 
bability  of  co-occurrence  of  all  pairs  of  index  terms, 
t-  and  t ^  in  a  document, 


p  ,  , _ Number  of  documents  Indexed  by  t.  and  t . 

Iti^jj  'j'otal  number  of  documents  1  - 


Since  these  fractions  are  indications  of  the  fre¬ 
quency  of  use  of  the  basis  vectors  of  the  m-space,  in¬ 
dividually  and  in  pairs,  they  indicate  the  directions  taken 
by  document  vectors  in  m-space.  The  indications  of  the 
length  of  the  document  vectors  is  given  by  the  distribution 
of  the  number  of  index  terms  per  document  in  the  file.  This 
discrete  probability  distribution,  called  N(x),  is  obtained 
by  sampling  the  number  of  index  terms  assigned  to  documents 
in  a  real  file. 


Step  1 1 .  After  a  random  query  vector  has  been 
generated,  the  density  of  documents  falling  into  the  n- 
dimensional  subspace  identified  by  the  it- term  query  vector 
is  calculated.  This  density  must  include  documents  whose 
total  description  vector  lies  outside  the  query  subspacc, 
but  which  have  some  terms  in  common  with  the  query.  Each 
such  document  is  projected  into  the  k-dimensional  subspace 
of  the  query  space  where  k  is  the  number  of  terms  the  docu¬ 
ment  and  query  have  in  common. 

The  algorithm  for  generating  document  vectors  is 
as  follows: 

(1)  A  random  number  x  is  generated,  and  the  dis¬ 
tribution  N(x)  is  used  to  determine  the  number 
of  terms  T  by  which  the  document  is  defined, 

(2)  The  list  of  probabilities  Pit.)  (]• .  i.r.J  i; 
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used  as  a  second  distribution  to  select  a 
particular  index  term.  Assume  term  t  . 
is  chosen. 

If  T"l  a  second  index  term  is  needed  to  de¬ 
scribe  the  document.  The  conditional  pro¬ 
babilities  of  selecting  term  tj  given  that 
t.  has  been  selected  can  be  calculated  from 
the  list  of  co-occurrence  probabilities: 


P(ti/tj)=P(tit.)/P(ti). 


Therefore,  the  list  of  probabilities  PCt^/ta^) 

is  used  as  a  distribution  to  select  a  second 

index  term.  Assume  term  t  is  selected  as 

a2 

the  second  term. 


If  T>2,  a  third  index  term  is  needed.  It 
should  be  selected  from  the  conditional  pro¬ 
bability  P  (t,/t . t. ) ;  however,  this  is  not 

K  1  J 

available.  It  may  be  approximated  by  the 
geometric  mean  of  the  conditional  probabilities 
P(ti/ta)  and  P(t./tb)  that  are  available  from 
the  co-occurrence  probabilities.  The  list 

p  <  W„> 

is  used  as  a  distribution  to  select  the  third 
te  rm . 


bach  additional  index  term,  up  to  T,  is  selected 
by  a  probability  distribution  derived  from  the 
conditional  probabilities  given  the  previously 
selected  terms.  The  above  approximation  is 
gene ra ! i :ed .  Thus,  for  the  k+ls 
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t  c  rm : 


For  a  particular  query  vector,  the  subspaces  of 
interest  in  the  indcx-tcrm-spacc  are  identified  as  follows. 
The  n  terms  of  the  query  vector  arc  ordered;  if  a  particular 
subspacc  lias  the  dimension  corresponding  to  a  particular 
term,  the  binary  form  of  its  identifying  number  will  have 
a  one  in  the  position  of  that  index  term  following  the  order¬ 
ing  of  the  query  vector  terms,  otherwise  it  will  be  zero. 

Each  subspace  as  used  here  does  not  include  any  of  its 
subspaces,  thus  the  set  of  subspaces  partitions  the  set  of 
documents . 


Each  document  vector  generated  according  to  this 
procedure  will  either  fall  into  one  of  the  2n-l  subspaces 
of  the  query  vector  space  (not  including  the  null  space)  or 
it  will  be  projected  into  one  of  the  2n  subspaces  of  the 
query  vector  space.  Therefore,  2ncounters  ,  CQ  to  C^n^  will 
be  maintained  and  will  count  the  number  of  documents  which 
either  fall  onto  or  are  projected  onto  the  surface  of  each 
of  the  corresponding  subspaces  Sq  to  S2n_j.  The  counter 
corresponding  to  the  zero-dimension  subspace  will  count 
all  documents  not  implicated  by  the  query. 

Let  D  be  the  total  number  of  documents  generated. 

Cq  of  the  D  documents  have  no  index  term  in  common  with  the 
query,  that  is,  they  fall  into  the  0-dimension  subspacc.  The 
query  then  implicates  a  fraction  (H-Cy)/D  of  the  total  file. 

Eacli  subspacc  S.  of  the  file  will  contain  a  density 
of  documents  D.=(b/I)  relative  to  the  entire  file. 


St c-i'  III.  In  tlu*  last  step  of  the  procedure, 
tlu-  portion  of  the  document  space  that  either  falls  onto 
or  is  projected  onto  the  surface  subtended  by  the  solid 
ancle  h  is  calculated  as  a  function  of  the  angle  b. 


Let  b.  be  the  angular  distance  from  the  query 
vector  Q  to  the  border  of  the  i ^  n-1  space,  which  is  the 
sabspace  identified  by  the  suhspacc  number  2 n  - 1  -  2 11  1.  Let 

b  .  =min  <  b  ,  b  ■, .  .  .  b  >  . 
nn  n  )  1  l  n  f 


For  b<b  j  ,  the  part  of  the  file  encompassed 
within  ancle  b  of  Q  is  proportional  to  the  surface  area  in 
the  first  orthant  in  n-space  subtended  by  b.  We  call  this 
quantity  S(b,n).  The  total  surface  area  in  the  first 
orthant  in  n-space  is  S(n).  Thus,  the  measure  of  the  file 
encompassed  within  angle  b  is 


f  ( b  )  = 


S(b,n) 
S  (n) 


U2n-1 


> 


where  I',n_j  is  the  density  of  documents  on  the  n-space  sur¬ 
face  . 


Now,  we  will  increase  the  solid  angle  b  beyond 

the  border  of  the  closest  subspace.  If  b, =b  •  ,  the 

k  mi  n  £ 

closest  subspace  will  be  the  suhspacc  with  the  k L  dimension 
missing.  Its  identifying  number  will  be 

..n  ,  ,n-k 
a= -  -  1  -  - 

found  by  subtracting  the  one  in  the  k^1  hit  position  of  the 
h  i  i:a  ry  -  form  i  dent  i  fy  ing  number. 

Lot  O  be  the  projection  of  Q  into  S.( .  Let  b^ 


1C  .P 


be  the  angle  from  Q  which  subtends  the  sane  n-1  dimensional 

cl 

surface  as  does  the  angle  b.  Then  1)  and  b  arc  related  by: 

2  2  1 
cos  b, +cos  b  =cos“b 
k  a 

Thus  . ]  /  9  2 

ba=cos  «/cos'"b  -  cos  b^ 

When  b  is  increased  beyond  the  border  of  the 
closest  subspace,  but  not  as  far  as  the  next-closest  border, 
the  depth  of  the  file  encompassc  '  by  b  also  includes  the 
contribution  of  the  file  in  n-1  space  that  is  subtended  by 

the  ancle  b  .  This  contribution  is 

c‘  a 

S(ba,n-l).Da 

S(n-l) 

Thus 

S(b,n)  S(b  n-1) 

F  (b )  - -  -Ibn  + - 2. - —  •  D  . 

S(n)  £  1  S(n-l)  a 

As  the  angle  b  is  expanded  still  further,  it  will 
either  reach  the  border  of  another  n-1  space,  or  else  the 
angle  b  will  reach  the  border  of  an  n-2  space. 

3 


For  the  first  case,  we  identify 

b  j=min  (bj  ,b2  , .  . .  >  bp,  - 1  >bp+  j  »  •  •  •  >l>n) 
and  the  next  subspacc  is 

c=2n-l-2n"j  . 


tance  of  Q 


b 


a  1 


arb: trar i  1 


For  the  second  case,  we  measure 

to  each  of  the  (n-1)  n-2  space 
3 

.  ,b  be  these  angular  distance's 
an 

v  large  value  for  b  j.  (which  now 


the  angular  dis- 
borders.  Let 
,  including  an 
has  no  meaning) 


1  !  .  A 


in  i'l'J 


1 1  -  2  s  v 


will  oc 


this 


let  D 


The  :'i 

t  i'  i  !•  a  t 

\ 


j  e  c ;  i 


•  r  to  l.i'i'i’  the  same  o  nle  r  i  ng . 

Strv’i'H-  h  =  nin{  ,  , b  .  .  ,b  |.  Then  the  first 

lee  ".mlcivJ  is  i  dent  i  f  ied  by  d=2n  - 1  -  2  - 2^  =  a- 2^ . 

As  ;ii:"le  I'  increases,  it  must  he  determined  whether 

b  .  or  h  -■  s 
i  a  ap 

cur  first. 


Us  in.1’,  the  definition  of  b,  ,  this  implies  that  if 


h  cos 

l 


r^/-es2 


s“b  -  cos  b,  , 
a  p  k 


irst  case  will  occur  first. 


Determining  the  dentil  of  file  for  the  first  case, 
be  tlie  projection,  of  1  into  space  S  ,  and 


- 1  )  2 
'c=eos  a/cos 


h  -  cos  bj 


;  l-  encompassed  by  the  angle  b  will  include  tlie  con¬ 
i'  n.  in  t h i s  sub  nace;  thus 


M 


S(b,n)  S(ba,n-1)  s(bc>11'1)  .  D 

7^7“  2  -1  S(n-l)  3  S(n-l) 


Returning;  to  the  second  case,  let  Qj  be  the  pro¬ 
of  -1  into  the  n-2  space  d,  and  baj  be  the  angle 
ce  d  su’nt  end  i  nr  the  same  surface  area  as  the  angle  b. 


b  ,  =  eo 
ad 


'  ’  cos “b ^ 


2, 

COS  ll 


ap 


h  c  :  t  r  '•  but  i  on  to  the  file  encompassed  by  the  angle 


Sib  , ,  n  -  2  )  ,, 

— j.“L -  •  n(!  * 

S  (  n  -  2  ) 


12.  A 


ami  the  total  file  encompassed  by  the  angle  b  is 


S (b , n) 

1(b) =  - 

S(n) 


°2n-l 


S(b  n-1)  S(b  ,,n-2) 

+  - “ -  •  D  +  _ — _ 

S(n-l)  a 


Ultimately,  as  angle  b  is  increased  sufficiently, 
the  last  term  from  both  of  the  above  expressions  for  F(b) 
will  become  included  in  the  equation  for  F(b): 


S  ( b  ,  n ) 

F  (b )  =  - - 

S  (n) 

+  S^bad’n'^ 
S (n- 2) 


°2n  - 1 


Dd  + 


S(b  n-1) 
+  - B. _ 

S(n-l) 

S(bc,n-1) 

S(n-l) 


D 


a 


It  is  obvious  that  as  the  angle  b  is  increased 
further,  more  subspaces  will  be  encompassed  within  angle  b, 
and  F(b)  will  have  even  more  terms.  A  general  structure  is 
required  which  will  include  the  present  and  future  con¬ 
tributions  to  the  file  for  each  subspace.  The  possibility 
of  programming  the  computation  of  F(b)  in  a  recursive  pro¬ 
gram  is  obvious,  since  the  computation  of  the  contribution 
of  each  subspace  of  dimension  k  is  the  same  as  that  of  the 
subspace  of  dimension  k+1. 

Finally,  it  is  expected  that  a  significant  result 
for  the  F(b)  relation  will  be  obtained  long  before  all  the 
subspaccs  of  the  n- space  are  included.  Therefore,  an 
approximation  to  this  procedure  will  be  obtained  if  only 
the  cases  of  subspaces  of  degree  n-1  and  n-2  arc  considered. 
In  this  case,  it  is  not  necessary  for  the  program  which 
performs  the  calculations  to  be  recursive.  All  combinations 
that  may  be  required  for  calculations  within  subspaces  may 
be  represented  explicitly. 
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